Summarize the cleaned data¶
Tables of descriptive statistics
The feature summaries displayed in earlier notebooks are no longer valid due to the data-cleaning performed in those notebooks. This notebook displays feature summaries for the cleaned data.
-
As a complement to the summary tables, the notebook also plots the distribution of values for each feature (excluding categorical features that have a large number of distinct values).
from pathlib import Path
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from IPython.display import SVG, display
from matplotlib.ticker import FuncFormatter, MaxNLocator, PercentFormatter
from pandas import Categorical, Series, to_datetime
from pandas.api.types import is_integer_dtype, is_numeric_dtype, is_string_dtype
from notebook_tools.data_cleaning import (
convert_acc_loan_data,
filter_acc_loan_data,
load_acc_loan_data,
load_acc_loan_feat_desc,
)
from notebook_tools.feature_exploration import (
get_group_sizes,
style_loan_summary,
summarize_acc_loans,
)
from notebook_tools.plots import format_counts, get_wrapped_caption
loan_data = load_acc_loan_data().pipe(convert_acc_loan_data).pipe(filter_acc_loan_data)
feat_desc = load_acc_loan_feat_desc()
Tables of descriptive statistics¶
print(f"\n\nThe number of records is {len(loan_data.index):,d}.")
for dtype in [np.number, "string", "boolean"]:
summary = summarize_acc_loans(loan_data, dtype, feat_desc)
print(f"\n\nThe number of features of type {dtype} is {len(summary.index)}.\n\n")
display(style_loan_summary(summary))
The number of records is 2,217,507. The number of features of type <class 'numpy.number'> is 110.
| count | mean | std | min | 25% | 50% | 75% | max | data type | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| loan_amnt | 2,217,507 | 15,122.2 | 9,204.2 | 1,000.0 | 8,000.0 | 13,000.0 | 20,000.0 | 40,000.0 | Float64 | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. |
| funded_amnt | 2,217,507 | 15,122.0 | 9,204.1 | 1,000.0 | 8,000.0 | 13,000.0 | 20,000.0 | 40,000.0 | Float64 | The total amount committed to that loan at that point in time. |
| funded_amnt_inv | 2,217,507 | 15,116.5 | 9,201.9 | 725.0 | 8,000.0 | 13,000.0 | 20,000.0 | 40,000.0 | Float64 | The total amount committed by investors for that loan at that point in time. |
| term | 2,217,507 | 42.9 | 10.9 | 36.0 | 36.0 | 36.0 | 60.0 | 60.0 | Int64 | The number of payments on the loan. Values are in months and can be either 36 or 60. |
| int_rate | 2,217,507 | 13.1 | 4.8 | 5.3 | 9.5 | 12.6 | 16.0 | 31.0 | Float64 | Interest Rate on the loan |
| installment | 2,217,507 | 448.2 | 267.6 | 7.6 | 254.0 | 380.7 | 595.5 | 1,719.8 | Float64 | The monthly payment owed by the borrower if the loan originates. |
| annual_inc | 2,217,507 | 78,163.7 | 113,432.4 | 0.0 | 46,000.0 | 65,000.0 | 93,500.0 | 110,000,000.0 | Float64 | The self-reported annual income provided by the borrower during registration. |
| dti | 2,215,797 | 18.9 | 14.3 | -1.0 | 12.0 | 17.9 | 24.6 | 999.0 | Float64 | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. |
| delinq_2yrs | 2,217,507 | 0.3 | 0.9 | 0.0 | 0.0 | 0.0 | 0.0 | 58.0 | Int64 | The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years |
| fico_range_low | 2,217,507 | 698.3 | 32.9 | 660.0 | 675.0 | 690.0 | 715.0 | 845.0 | Int64 | The lower boundary range the borrower’s FICO at loan origination belongs to. |
| fico_range_high | 2,217,507 | 702.3 | 32.9 | 664.0 | 679.0 | 694.0 | 719.0 | 850.0 | Int64 | The upper boundary range the borrower’s FICO at loan origination belongs to. |
| inq_last_6mths | 2,217,506 | 0.6 | 0.9 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 | Int64 | The number of inquiries in past 6 months (excluding auto and mortgage inquiries) |
| mths_since_last_delinq | 1,086,217 | 34.5 | 21.9 | 0.0 | 16.0 | 31.0 | 50.0 | 226.0 | Int64 | The number of months since the borrower's last delinquency. |
| mths_since_last_record | 355,456 | 72.4 | 26.1 | 0.0 | 55.0 | 74.0 | 92.0 | 127.0 | Int64 | The number of months since the last public record. |
| open_acc | 2,217,507 | 11.7 | 5.7 | 0.0 | 8.0 | 11.0 | 14.0 | 101.0 | Int64 | The number of open credit lines in the borrower's credit file. |
| pub_rec | 2,217,507 | 0.2 | 0.6 | 0.0 | 0.0 | 0.0 | 0.0 | 86.0 | Int64 | Number of derogatory public records |
| revol_bal | 2,217,507 | 16,703.6 | 22,964.0 | 0.0 | 5,995.0 | 11,367.0 | 20,300.0 | 2,904,836.0 | Float64 | Total credit revolving balance |
| revol_util | 2,215,795 | 50.4 | 24.6 | 0.0 | 31.6 | 50.3 | 69.3 | 892.3 | Float64 | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit. |
| total_acc | 2,217,507 | 24.2 | 12.0 | 2.0 | 15.0 | 22.0 | 31.0 | 176.0 | Int64 | The total number of credit lines currently in the borrower's credit file |
| out_prncp | 2,217,507 | 4,287.5 | 7,390.0 | 0.0 | 0.0 | 0.0 | 6,367.8 | 40,000.0 | Float64 | Remaining outstanding principal for total amount funded |
| out_prncp_inv | 2,217,507 | 4,286.5 | 7,389.1 | 0.0 | 0.0 | 0.0 | 6,365.0 | 40,000.0 | Float64 | Remaining outstanding principal for portion of total amount funded by investors |
| total_pymnt | 2,217,507 | 12,082.8 | 9,915.8 | 0.0 | 4,528.1 | 9,320.4 | 16,951.7 | 63,296.9 | Float64 | Payments received to date for total amount funded |
| total_pymnt_inv | 2,217,507 | 12,077.9 | 9,911.8 | 0.0 | 4,526.0 | 9,315.9 | 16,943.5 | 63,296.9 | Float64 | Payments received to date for portion of total amount funded by investors |
| total_rec_prncp | 2,217,507 | 9,501.8 | 8,343.2 | 0.0 | 3,000.0 | 7,000.0 | 13,900.2 | 40,000.0 | Float64 | Principal received to date |
| total_rec_int | 2,217,507 | 2,434.8 | 2,681.2 | 0.0 | 729.7 | 1,530.0 | 3,113.3 | 28,192.5 | Float64 | Interest received to date |
| total_rec_late_fee | 2,217,507 | 1.5 | 11.9 | -0.0 | 0.0 | 0.0 | 0.0 | 1,484.3 | Float64 | Late fees received to date |
| recoveries | 2,217,507 | 144.7 | 748.4 | 0.0 | 0.0 | 0.0 | 0.0 | 39,859.6 | Float64 | post charge off gross recovery |
| collection_recovery_fee | 2,217,507 | 24.2 | 130.6 | 0.0 | 0.0 | 0.0 | 0.0 | 7,174.7 | Float64 | post charge off collection fee |
| last_pymnt_amnt | 2,217,507 | 3,445.2 | 6,044.5 | 0.0 | 311.4 | 602.0 | 3,762.7 | 42,192.1 | Float64 | Last total payment amount received |
| last_fico_range_high | 2,217,507 | 687.6 | 72.8 | 0.0 | 654.0 | 699.0 | 734.0 | 850.0 | Int64 | The upper boundary range the borrower’s last FICO pulled belongs to. |
| last_fico_range_low | 2,217,507 | 675.5 | 110.9 | 0.0 | 650.0 | 695.0 | 730.0 | 845.0 | Int64 | The lower boundary range the borrower’s last FICO pulled belongs to. |
| collections_12_mths_ex_med | 2,217,507 | 0.0 | 0.2 | 0.0 | 0.0 | 0.0 | 0.0 | 20.0 | Int64 | Number of collections in 12 months excluding medical collections |
| mths_since_last_major_derog | 580,559 | 44.2 | 21.5 | 0.0 | 27.0 | 44.0 | 62.0 | 226.0 | Int64 | Months since most recent 90-day or worse rating |
| annual_inc_joint | 120,664 | 123,628.2 | 74,170.3 | 5,693.5 | 83,400.0 | 110,000.0 | 148,000.0 | 7,874,821.0 | Float64 | The combined self-reported annual income provided by the co-borrowers during registration |
| dti_joint | 120,660 | 19.3 | 7.8 | 0.0 | 13.5 | 18.8 | 24.6 | 69.5 | Float64 | A ratio calculated using the co-borrowers' total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers' combined self-reported monthly income |
| acc_now_delinq | 2,217,507 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 14.0 | Int64 | The number of accounts on which the borrower is now delinquent. |
| tot_coll_amt | 2,189,774 | 232.8 | 8,519.6 | 0.0 | 0.0 | 0.0 | 0.0 | 9,152,545.0 | Float64 | Total collection amounts ever owed |
| tot_cur_bal | 2,189,774 | 142,484.6 | 160,690.5 | 0.0 | 29,090.0 | 79,232.5 | 213,190.0 | 9,971,659.0 | Float64 | Total current balance of all accounts |
| open_acc_6m | 1,394,160 | 0.9 | 1.1 | 0.0 | 0.0 | 1.0 | 1.0 | 18.0 | Int64 | Number of open trades in last 6 months |
| open_act_il | 1,394,161 | 2.8 | 3.0 | 0.0 | 1.0 | 2.0 | 3.0 | 57.0 | Int64 | Number of currently active installment trades |
| open_il_12m | 1,394,161 | 0.7 | 0.9 | 0.0 | 0.0 | 0.0 | 1.0 | 25.0 | Int64 | Number of installment accounts opened in past 12 months |
| open_il_24m | 1,394,161 | 1.6 | 1.6 | 0.0 | 0.0 | 1.0 | 2.0 | 51.0 | Int64 | Number of installment accounts opened in past 24 months |
| mths_since_rcnt_il | 1,350,371 | 21.2 | 26.1 | 0.0 | 7.0 | 13.0 | 24.0 | 511.0 | Int64 | Months since most recent installment accounts opened |
| total_bal_il | 1,394,161 | 35,502.6 | 44,097.4 | 0.0 | 8,693.0 | 23,123.0 | 46,088.0 | 1,837,038.0 | Float64 | Total current balance of all installment accounts |
| il_util | 1,191,472 | 69.1 | 23.7 | 0.0 | 55.0 | 72.0 | 86.0 | 1,000.0 | Float64 | Ratio of total current balance to high credit/credit limit on all install acct |
| open_rv_12m | 1,394,161 | 1.3 | 1.5 | 0.0 | 0.0 | 1.0 | 2.0 | 28.0 | Int64 | Number of revolving trades opened in past 12 months |
| open_rv_24m | 1,394,161 | 2.7 | 2.6 | 0.0 | 1.0 | 2.0 | 4.0 | 60.0 | Int64 | Number of revolving trades opened in past 24 months |
| max_bal_bc | 1,394,161 | 5,806.5 | 5,690.9 | 0.0 | 2,284.0 | 4,413.0 | 7,598.0 | 1,170,668.0 | Float64 | Maximum current balance owed on all revolving accounts |
| all_util | 1,393,942 | 57.0 | 20.9 | 0.0 | 43.0 | 58.0 | 72.0 | 239.0 | Float64 | Balance to credit limit on all trades |
| total_rev_hi_lim | 2,189,774 | 34,575.0 | 36,730.5 | 0.0 | 14,700.0 | 25,400.0 | 43,200.0 | 9,999,999.0 | Float64 | Total revolving high credit/credit limit |
| inq_fi | 1,394,161 | 1.0 | 1.5 | 0.0 | 0.0 | 1.0 | 1.0 | 48.0 | Int64 | Number of personal finance inquiries |
| total_cu_tl | 1,394,160 | 1.5 | 2.7 | 0.0 | 0.0 | 0.0 | 2.0 | 111.0 | Int64 | Number of finance trades |
| inq_last_12m | 1,394,160 | 2.0 | 2.4 | 0.0 | 0.0 | 1.0 | 3.0 | 67.0 | Int64 | Number of credit inquiries in past 12 months |
| acc_open_past_24mths | 2,210,013 | 4.5 | 3.2 | 0.0 | 2.0 | 4.0 | 6.0 | 64.0 | Int64 | Number of trades opened in past 24 months. |
| avg_cur_bal | 2,189,704 | 13,547.5 | 16,474.3 | 0.0 | 3,080.0 | 7,335.0 | 18,782.0 | 958,084.0 | Float64 | Average current balance of all accounts |
| bc_open_to_buy | 2,185,120 | 11,395.4 | 16,600.7 | 0.0 | 1,722.0 | 5,443.0 | 14,189.0 | 711,140.0 | Float64 | Total open to buy on revolving bankcards. |
| bc_util | 2,183,984 | 57.9 | 28.6 | 0.0 | 35.4 | 60.2 | 83.1 | 339.6 | Float64 | Ratio of total current balance to high credit/credit limit for all bankcard accounts. |
| chargeoff_within_12_mths | 2,217,507 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 10.0 | Int64 | Number of charge-offs within 12 months |
| delinq_amnt | 2,217,507 | 12.6 | 733.3 | 0.0 | 0.0 | 0.0 | 0.0 | 249,925.0 | Float64 | The past-due amount owed for the accounts on which the borrower is now delinquent. |
| mo_sin_old_il_acct | 2,120,990 | 125.7 | 53.4 | 0.0 | 96.0 | 130.0 | 154.0 | 999.0 | Int64 | Months since oldest bank installment account opened |
| mo_sin_old_rev_tl_op | 2,189,773 | 181.5 | 97.1 | 1.0 | 116.0 | 164.0 | 232.0 | 999.0 | Int64 | Months since oldest revolving account opened |
| mo_sin_rcnt_rev_tl_op | 2,189,773 | 14.0 | 17.5 | 0.0 | 4.0 | 8.0 | 17.0 | 547.0 | Int64 | Months since most recent revolving account opened |
| mo_sin_rcnt_tl | 2,189,774 | 8.3 | 9.2 | 0.0 | 3.0 | 6.0 | 11.0 | 382.0 | Int64 | Months since most recent account opened |
| mort_acc | 2,210,013 | 1.6 | 1.9 | 0.0 | 0.0 | 1.0 | 3.0 | 94.0 | Int64 | Number of mortgage accounts. |
| mths_since_recent_bc | 2,186,642 | 24.8 | 32.3 | 0.0 | 6.0 | 14.0 | 30.0 | 661.0 | Int64 | Months since most recent bankcard account opened. |
| mths_since_recent_bc_dlq | 519,512 | 39.3 | 22.6 | 0.0 | 21.0 | 37.0 | 57.0 | 202.0 | Int64 | Months since most recent bankcard delinquency |
| mths_since_recent_inq | 1,964,655 | 7.0 | 6.0 | 0.0 | 2.0 | 5.0 | 11.0 | 25.0 | Int64 | Months since most recent inquiry. |
| mths_since_recent_revol_delinq | 740,121 | 35.8 | 22.3 | 0.0 | 17.0 | 33.0 | 51.0 | 202.0 | Int64 | Months since most recent revolving delinquency. |
| num_accts_ever_120_pd | 2,189,774 | 0.5 | 1.4 | 0.0 | 0.0 | 0.0 | 0.0 | 58.0 | Int64 | Number of accounts ever 120 or more days past due |
| num_actv_bc_tl | 2,189,774 | 3.7 | 2.3 | 0.0 | 2.0 | 3.0 | 5.0 | 50.0 | Int64 | Number of currently active bankcard accounts |
| num_actv_rev_tl | 2,189,774 | 5.6 | 3.4 | 0.0 | 3.0 | 5.0 | 7.0 | 72.0 | Int64 | Number of currently active revolving trades |
| num_bc_sats | 2,201,457 | 4.8 | 3.0 | 0.0 | 3.0 | 4.0 | 6.0 | 71.0 | Int64 | Number of satisfactory bankcard accounts |
| num_bc_tl | 2,189,774 | 7.7 | 4.7 | 0.0 | 4.0 | 7.0 | 10.0 | 86.0 | Int64 | Number of bankcard accounts |
| num_il_tl | 2,189,774 | 8.4 | 7.4 | 0.0 | 3.0 | 6.0 | 11.0 | 159.0 | Int64 | Number of installment accounts |
| num_op_rev_tl | 2,189,774 | 8.2 | 4.7 | 0.0 | 5.0 | 7.0 | 10.0 | 91.0 | Int64 | Number of open revolving accounts |
| num_rev_accts | 2,189,773 | 14.0 | 8.0 | 0.0 | 8.0 | 12.0 | 18.0 | 151.0 | Int64 | Number of revolving accounts |
| num_rev_tl_bal_gt_0 | 2,189,774 | 5.6 | 3.3 | 0.0 | 3.0 | 5.0 | 7.0 | 65.0 | Int64 | Number of revolving trades with balance >0 |
| num_sats | 2,201,457 | 11.6 | 5.6 | 0.0 | 8.0 | 11.0 | 14.0 | 101.0 | Int64 | Number of satisfactory accounts |
| num_tl_120dpd_2m | 2,106,407 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.0 | Int64 | Number of accounts currently 120 days past due (updated in past 2 months) |
| num_tl_30dpd | 2,189,774 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | Int64 | Number of accounts currently 30 days past due (updated in past 2 months) |
| num_tl_90g_dpd_24m | 2,189,774 | 0.1 | 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | 58.0 | Int64 | Number of accounts 90 or more days past due in last 24 months |
| num_tl_op_past_12m | 2,189,774 | 2.1 | 1.8 | 0.0 | 1.0 | 2.0 | 3.0 | 32.0 | Int64 | Number of accounts opened in past 12 months |
| pct_tl_nvr_dlq | 2,189,619 | 94.1 | 9.0 | 0.0 | 91.3 | 100.0 | 100.0 | 100.0 | Float64 | Percent of trades never delinquent |
| percent_bc_gt_75 | 2,184,676 | 42.4 | 36.2 | 0.0 | 0.0 | 37.5 | 71.4 | 100.0 | Float64 | Percentage of all bankcard accounts > 75% of limit. |
| pub_rec_bankruptcies | 2,217,507 | 0.1 | 0.4 | 0.0 | 0.0 | 0.0 | 0.0 | 12.0 | Int64 | Number of public record bankruptcies |
| tax_liens | 2,217,507 | 0.0 | 0.4 | 0.0 | 0.0 | 0.0 | 0.0 | 85.0 | Int64 | Number of tax liens |
| tot_hi_cred_lim | 2,189,774 | 178,236.7 | 181,574.2 | 0.0 | 50,729.0 | 114,292.0 | 257,743.0 | 9,999,999.0 | Float64 | Total high credit/credit limit |
| total_bal_ex_mort | 2,210,013 | 51,019.2 | 49,911.3 | 0.0 | 20,890.0 | 37,861.0 | 64,343.0 | 3,408,095.0 | Float64 | Total credit balance excluding mortgage |
| total_bc_limit | 2,210,013 | 23,195.1 | 23,007.8 | 0.0 | 8,300.0 | 16,300.0 | 30,300.0 | 1,569,000.0 | Float64 | Total bankcard high credit/credit limit |
| total_il_high_credit_limit | 2,189,774 | 43,728.1 | 45,072.3 | 0.0 | 15,000.0 | 32,692.0 | 58,798.0 | 2,118,996.0 | Float64 | Total installment high credit/credit limit |
| revol_bal_joint | 107,982 | 33,616.7 | 28,154.7 | 0.0 | 15,107.0 | 26,515.0 | 43,765.8 | 1,110,019.0 | Float64 | Sum of revolving credit balance of the co-borrowers, net of duplicate balances |
| sec_app_fico_range_low | 107,983 | 669.8 | 44.7 | 540.0 | 645.0 | 670.0 | 695.0 | 845.0 | Int64 | FICO range (high) for the secondary applicant |
| sec_app_fico_range_high | 107,983 | 673.8 | 44.7 | 544.0 | 649.0 | 674.0 | 699.0 | 850.0 | Int64 | FICO range (low) for the secondary applicant |
| sec_app_inq_last_6mths | 107,983 | 0.6 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 | Int64 | Credit inquiries in the last 6 months at time of application for the secondary applicant |
| sec_app_mort_acc | 107,983 | 1.5 | 1.8 | 0.0 | 0.0 | 1.0 | 2.0 | 27.0 | Int64 | Number of mortgage accounts at time of application for the secondary applicant |
| sec_app_open_acc | 107,983 | 11.5 | 6.6 | 0.0 | 7.0 | 10.0 | 15.0 | 82.0 | Int64 | Number of open trades at time of application for the secondary applicant |
| sec_app_revol_util | 106,147 | 58.2 | 25.5 | 0.0 | 39.8 | 60.2 | 78.6 | 434.3 | Float64 | Ratio of total current balance to high credit/credit limit for all revolving accounts |
| sec_app_open_act_il | 107,983 | 3.0 | 3.3 | 0.0 | 1.0 | 2.0 | 4.0 | 43.0 | Int64 | Number of currently active installment trades at time of application for the secondary applicant |
| sec_app_num_rev_accts | 107,983 | 12.5 | 8.2 | 0.0 | 7.0 | 11.0 | 17.0 | 106.0 | Int64 | Number of revolving accounts at time of application for the secondary applicant |
| sec_app_chargeoff_within_12_mths | 107,983 | 0.0 | 0.4 | 0.0 | 0.0 | 0.0 | 0.0 | 21.0 | Int64 | Number of charge-offs within last 12 months at time of application for the secondary applicant |
| sec_app_collections_12_mths_ex_med | 107,983 | 0.1 | 0.4 | 0.0 | 0.0 | 0.0 | 0.0 | 23.0 | Int64 | Number of collections within last 12 months excluding medical collections at time of application for the secondary applicant |
| sec_app_mths_since_last_major_derog | 35,925 | 36.9 | 23.9 | 0.0 | 16.0 | 36.0 | 56.0 | 185.0 | Int64 | Months since most recent 90-day or worse rating at time of application for the secondary applicant |
| hardship_amount | 10,913 | 155.0 | 129.0 | 0.6 | 59.4 | 119.1 | 213.2 | 943.9 | Float64 | The interest payment that the borrower has committed to make each month while they are on a hardship plan |
| hardship_dpd | 10,913 | 13.7 | 9.7 | 0.0 | 5.0 | 15.0 | 22.0 | 37.0 | Int64 | Account days past due as of the hardship plan start date |
| orig_projected_additional_accrued_interest | 8,648 | 454.7 | 375.3 | 1.9 | 175.2 | 352.7 | 620.1 | 2,680.9 | Float64 | The original projected additional interest amount that will accrue for the given hardship payment plan as of the Hardship Start Date. This field will be null if the borrower has broken their hardship payment plan. |
| hardship_payoff_balance_amount | 10,913 | 11,634.9 | 7,624.4 | 55.7 | 5,627.0 | 10,028.4 | 16,143.4 | 40,306.4 | Float64 | The payoff balance amount as of the hardship plan start date |
| hardship_last_payment_amount | 10,913 | 194.0 | 198.6 | 0.0 | 44.4 | 133.2 | 284.2 | 1,407.9 | Float64 | The last payment amount as of the hardship plan start date |
| settlement_amount | 34,083 | 5,014.2 | 3,695.4 | 44.2 | 2,209.0 | 4,148.0 | 6,856.0 | 33,601.0 | Float64 | The loan amount that the borrower has agreed to settle for |
| settlement_percentage | 34,083 | 47.8 | 7.2 | 0.2 | 45.0 | 45.0 | 50.0 | 521.4 | Float64 | The settlement amount as a percentage of the payoff balance amount on the loan |
| settlement_term | 34,083 | 13.2 | 8.1 | 0.0 | 6.0 | 14.0 | 18.0 | 181.0 | Int64 | The number of months that the borrower will be on the settlement plan |
The number of features of type string is 30.
| count | unique | top | freq | data type | description | |
|---|---|---|---|---|---|---|
| id | 2,217,507 | 2,217,507 | 68407277 | 1 | string | A unique LC assigned ID for the loan listing. |
| grade | 2,217,507 | 7 | B | 651,022 | string | LC assigned loan grade |
| sub_grade | 2,217,507 | 35 | C1 | 143,589 | string | LC assigned loan subgrade |
| emp_title | 2,053,170 | 489,609 | Teacher | 38,818 | string | The job title supplied by the Borrower when applying for the loan.* |
| emp_length | 2,071,717 | 11 | 10+ years | 738,438 | string | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. |
| home_ownership | 2,217,507 | 6 | MORTGAGE | 1,092,198 | string | The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER |
| verification_status | 2,217,507 | 3 | Source Verified | 875,634 | string | Indicates if income was verified by LC, not verified, or if the income source was verified |
| issue_d | 2,217,507 | 84 | 2016-03 | 61,969 | string | The month which the loan was funded |
| loan_status | 2,217,507 | 7 | Fully Paid | 1,042,311 | string | Current status of the loan |
| purpose | 2,217,507 | 14 | debt_consolidation | 1,257,717 | string | A category provided by the borrower for the loan request. |
| zip_code | 2,217,506 | 954 | 112xx | 23,362 | string | The first 3 numbers of the zip code provided by the borrower in the loan application. |
| addr_state | 2,217,507 | 51 | CA | 307,042 | string | The state provided by the borrower in the loan application |
| earliest_cr_line | 2,217,507 | 751 | 2004-09 | 15,172 | string | The month the borrower's earliest reported credit line was opened |
| initial_list_status | 2,217,507 | 2 | w | 1,535,014 | string | The initial listing status of the loan. Possible values are – W, F |
| last_pymnt_d | 2,215,163 | 86 | 2019-03 | 852,741 | string | Last month payment was received |
| next_pymnt_d | 912,333 | 8 | 2019-04 | 911,945 | string | Next scheduled payment date |
| last_credit_pull_d | 2,217,439 | 88 | 2019-03 | 1,362,420 | string | The most recent month LC pulled credit for this loan |
| application_type | 2,217,507 | 2 | Individual | 2,096,843 | string | Indicates whether the loan is an individual application or a joint application with two co-borrowers |
| verification_status_joint | 115,685 | 3 | Not Verified | 57,377 | string | Indicates if the co-borrowers' joint income was verified by LC, not verified, or if the income source was verified |
| sec_app_earliest_cr_line | 107,983 | 663 | 2006-08 | 998 | string | Earliest credit line at time of application for the secondary applicant |
| hardship_reason | 10,913 | 9 | NATURAL_DISASTER | 2,965 | string | Describes the reason the hardship plan was offered |
| hardship_status | 10,913 | 3 | COMPLETED | 7,816 | string | Describes if the hardship plan is active, pending, canceled, completed, or broken |
| hardship_start_date | 10,913 | 27 | 2017-09 | 2,443 | string | The start date of the hardship plan period |
| hardship_end_date | 10,913 | 28 | 2017-12 | 1,755 | string | The end date of the hardship plan period |
| payment_plan_start_date | 10,913 | 27 | 2017-09 | 1,714 | string | The day the first hardship plan payment is due. For example, if a borrower has a hardship plan period of 3 months, the start date is the start of the three-month period in which the borrower is allowed to make interest-only payments. |
| hardship_loan_status | 10,913 | 5 | Late (16-30 days) | 4,767 | string | Loan Status as of the hardship plan start date |
| disbursement_method | 2,217,507 | 2 | Cash | 2,139,393 | string | The method by which the borrower receives their loan. Possible values are: CASH, DIRECT_PAY |
| debt_settlement_flag_date | 34,083 | 70 | 2019-02 | 2,605 | string | The most recent date that the Debt_Settlement_Flag has been set |
| settlement_status | 34,083 | 3 | ACTIVE | 14,701 | string | The status of the borrower’s settlement plan. Possible values are: COMPLETE, ACTIVE, BROKEN, CANCELLED, DENIED, DRAFT |
| settlement_date | 34,083 | 75 | 2019-01 | 1,710 | string | The date that the borrower agrees to the settlement plan |
The number of features of type boolean is 3.
| count | unique | top | freq | data type | description | |
|---|---|---|---|---|---|---|
| pymnt_plan | 2,217,507 | 2 | False | 2,216,887 | boolean | Indicates if a payment plan has been put in place for the loan |
| hardship_flag | 2,217,507 | 2 | False | 2,216,675 | boolean | Flags whether or not the borrower is on a hardship plan |
| debt_settlement_flag | 2,217,507 | 2 | False | 2,183,424 | boolean | Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company. |
Plots of distributions¶
When working in R markdown, I have often used a loop to generate plots showing the distributions of all features. The plots function as a complement to summary tables such as those shown in the previous section of the notebook.
When I tried to use this workflow while plotting with seaborn in Jupyter, I found that memory leakage caused the kernel to die, even when I took care to ensure that all figures where closed at the end of each loop iteration. For a few years there have been posts about memory leakage associated with using matplotlib to generate figures in a loop within Jupyter, and it appears that the only reliable solution is to use a noninteractive backend for matplotlib.
To apply this workaround, each iteration of a loop that generates plots does the following:
- Use seaborn to generate a figure.
- Write the figure to disk in SVG format.
- Load the SVG figure from disk as a string.
- Pass the SVG string to functions from
IPython.displayto add the figure to the notebook.
These steps are encapsulated in the function display_svg_plot defined below.
matplotlib.use("svg")
sns.set_theme()
def display_svg_plot(plot):
temp_svg_path = Path.cwd() / "temp.svg"
plot.savefig(temp_svg_path, format="svg")
svg_string = temp_svg_path.read_text()
display(SVG(data=svg_string))
plt.close("all")
temp_svg_path.unlink()
Certain columns contain categorical data with a large number of possible values. Exclude these columns from the loop that generates plots of distribution.
do_not_plot = ["id", "emp_title", "zip_code"]
Define which type of plot to generate for each feature.
line_plots = [
"issue_d",
"earliest_cr_line",
"last_pymnt_d",
"next_pymnt_d",
"last_credit_pull_d",
"sec_app_earliest_cr_line",
"hardship_start_date",
"hardship_end_date",
"payment_plan_start_date",
"debt_settlement_flag_date",
"settlement_date",
]
boolean_plots = ["pymnt_plan", "hardship_flag", "debt_settlement_flag"]
logscale_histograms = [
"installment",
"annual_inc",
"dti",
"revol_bal",
"revol_util",
"out_prncp",
"out_prncp_inv",
"total_pymnt",
"total_pymnt_inv",
"total_rec_int",
"last_pymnt_amnt",
"annual_inc_joint",
"tot_cur_bal",
"mths_since_rcnt_il",
"total_bal_il",
"il_util",
"max_bal_bc",
"total_rev_hi_lim",
"avg_cur_bal",
"bc_open_to_buy",
"bc_util",
"mo_sin_old_il_acct",
"mo_sin_old_rev_tl_op",
"mo_sin_rcnt_rev_tl_op",
"mo_sin_rcnt_tl",
"mths_since_recent_bc",
"tot_hi_cred_lim",
"total_bal_ex_mort",
"total_bc_limit",
"total_il_high_credit_limit",
"revol_bal_joint",
"settlement_percentage",
]
linear_histograms = [
column
for column in loan_data
if column not in (do_not_plot + line_plots + boolean_plots + logscale_histograms)
]
Define variables used in plot customization.
custom_binwidth = {
"fico_range_low": 5,
"fico_range_high": 5,
"last_fico_range_low": 5,
"last_fico_range_high": 5,
"all_util": 2,
"sec_app_fico_range_low": 5,
"sec_app_fico_range_high": 5,
"pct_tl_nvr_dlq": 1,
}
percent_formatting = [
"pct_tl_nvr_dlq",
"percent_bc_gt_75",
"settlement_percentage",
]
angled_labels = [
"purpose",
"hardship_reason",
]
custom_ordered_categories = {
"emp_length": [
"< 1 year",
"1 year",
"2 years",
"3 years",
"4 years",
"5 years",
"6 years",
"7 years",
"8 years",
"9 years",
"10+ years",
]
}
def plot_distribution(data, column_name, title, caption, **kwargs):
column_data = data[column_name]
# If the values are strings, then set the order in which labels will appear along
# the x-axis.
if is_string_dtype(column_data.dtype):
column_data = _convert_to_ordered_categorical(column_data, column_name)
plot = sns.displot(x=column_data, **kwargs).set(title=title)
# Tweak tick locations and labels.
ax = plot.facet_axis(0, 0)
ax.yaxis.set_major_formatter(FuncFormatter(format_counts))
if column in percent_formatting:
ax.xaxis.set_major_formatter(PercentFormatter(decimals=0))
elif is_numeric_dtype(column_data.dtype) and "log_scale" not in kwargs:
# Force tick marks to be located on integers. The call below is equivalent to
# using matplotlib.ticker.AutoLocator with the constraint that tick marks are
# integers.
ax.xaxis.set_major_locator(
MaxNLocator(nbins="auto", steps=[1, 2, 2.5, 5, 10], integer=True)
)
# Add comma separators in formatting tick labels for the x-axis.
ax.xaxis.set_major_formatter(
FuncFormatter(
lambda tick_value, _tick_position: format(int(tick_value), ",")
)
)
if column in angled_labels:
# The steps to rotate tick labels for the x-axis using matplotlib's
# object-oriented approach are clumsy, so use pyplot instead.
plt.xticks(rotation=-45, ha="left")
display_svg_plot(plot)
print(f"\n{caption}\n\n\n", flush=True)
def _convert_to_ordered_categorical(col_data, col_name):
if col_name in custom_ordered_categories:
ordered_categories = custom_ordered_categories[col_name]
else:
ordered_categories = Series(col_data.unique()).dropna().sort_values()
categorical_data = Categorical(
col_data, categories=ordered_categories, ordered=True
)
categorical_data.name = col_data.name
return categorical_data
def plot_date_distribution(data, column_name, title, caption, **kwargs):
to_plot = get_group_sizes(data, group_by=column_name).rename(
{"count": "Count"}, axis="columns"
)
to_plot[column_name] = to_datetime(to_plot[column_name], format="ISO8601")
kwargs["kind"] = "line"
if len(to_plot[column_name].index) <= 150:
kwargs["marker"] = "o"
plot = sns.relplot(data=to_plot, x=column_name, y="Count", **kwargs).set(
title=title
)
# Tweak formatting of tick labels for the y-axis..
ax = plot.facet_axis(0, 0)
ax.yaxis.set_major_formatter(FuncFormatter(format_counts))
display_svg_plot(plot)
print(f"\n{caption}\n\n\n", flush=True)
for column in linear_histograms:
caption = get_wrapped_caption(column, feat_desc, width=200)
kwargs = {"aspect": 3.3}
if column in custom_binwidth:
kwargs["binwidth"] = custom_binwidth[column]
elif is_integer_dtype(loan_data[column].dtype):
kwargs["discrete"] = True
plot_distribution(loan_data, column, f"Distribution of {column}", caption, **kwargs)
loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
funded_amnt: The total amount committed to that loan at that point in time.
funded_amnt_inv: The total amount committed by investors for that loan at that point in time.
term: The number of payments on the loan. Values are in months and can be either 36 or 60.
int_rate: Interest Rate on the loan
grade: LC assigned loan grade
sub_grade: LC assigned loan subgrade
emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified
loan_status: Current status of the loan
purpose: A category provided by the borrower for the loan request.
addr_state: The state provided by the borrower in the loan application
delinq_2yrs: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
fico_range_low: The lower boundary range the borrower’s FICO at loan origination belongs to.
fico_range_high: The upper boundary range the borrower’s FICO at loan origination belongs to.
inq_last_6mths: The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
mths_since_last_delinq: The number of months since the borrower's last delinquency.
mths_since_last_record: The number of months since the last public record.
open_acc: The number of open credit lines in the borrower's credit file.
pub_rec: Number of derogatory public records
total_acc: The total number of credit lines currently in the borrower's credit file
initial_list_status: The initial listing status of the loan. Possible values are – W, F
total_rec_prncp: Principal received to date
total_rec_late_fee: Late fees received to date
recoveries: post charge off gross recovery
collection_recovery_fee: post charge off collection fee
last_fico_range_high: The upper boundary range the borrower’s last FICO pulled belongs to.
last_fico_range_low: The lower boundary range the borrower’s last FICO pulled belongs to.
collections_12_mths_ex_med: Number of collections in 12 months excluding medical collections
mths_since_last_major_derog: Months since most recent 90-day or worse rating
application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers
dti_joint: A ratio calculated using the co-borrowers' total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers' combined self- reported monthly income
verification_status_joint: Indicates if the co-borrowers' joint income was verified by LC, not verified, or if the income source was verified
acc_now_delinq: The number of accounts on which the borrower is now delinquent.
tot_coll_amt: Total collection amounts ever owed
open_acc_6m: Number of open trades in last 6 months
open_act_il: Number of currently active installment trades
open_il_12m: Number of installment accounts opened in past 12 months
open_il_24m: Number of installment accounts opened in past 24 months
open_rv_12m: Number of revolving trades opened in past 12 months
open_rv_24m: Number of revolving trades opened in past 24 months
all_util: Balance to credit limit on all trades
inq_fi: Number of personal finance inquiries
total_cu_tl: Number of finance trades
inq_last_12m: Number of credit inquiries in past 12 months
acc_open_past_24mths: Number of trades opened in past 24 months.
chargeoff_within_12_mths: Number of charge-offs within 12 months
delinq_amnt: The past-due amount owed for the accounts on which the borrower is now delinquent.
mort_acc: Number of mortgage accounts.
mths_since_recent_bc_dlq: Months since most recent bankcard delinquency
mths_since_recent_inq: Months since most recent inquiry.
mths_since_recent_revol_delinq: Months since most recent revolving delinquency.
num_accts_ever_120_pd: Number of accounts ever 120 or more days past due
num_actv_bc_tl: Number of currently active bankcard accounts
num_actv_rev_tl: Number of currently active revolving trades
num_bc_sats: Number of satisfactory bankcard accounts
num_bc_tl: Number of bankcard accounts
num_il_tl: Number of installment accounts
num_op_rev_tl: Number of open revolving accounts
num_rev_accts: Number of revolving accounts
num_rev_tl_bal_gt_0: Number of revolving trades with balance >0
num_sats: Number of satisfactory accounts
num_tl_120dpd_2m: Number of accounts currently 120 days past due (updated in past 2 months)
num_tl_30dpd: Number of accounts currently 30 days past due (updated in past 2 months)
num_tl_90g_dpd_24m: Number of accounts 90 or more days past due in last 24 months
num_tl_op_past_12m: Number of accounts opened in past 12 months
pct_tl_nvr_dlq: Percent of trades never delinquent
percent_bc_gt_75: Percentage of all bankcard accounts > 75% of limit.
pub_rec_bankruptcies: Number of public record bankruptcies
tax_liens: Number of tax liens
sec_app_fico_range_low: FICO range (high) for the secondary applicant
sec_app_fico_range_high: FICO range (low) for the secondary applicant
sec_app_inq_last_6mths: Credit inquiries in the last 6 months at time of application for the secondary applicant
sec_app_mort_acc: Number of mortgage accounts at time of application for the secondary applicant
sec_app_open_acc: Number of open trades at time of application for the secondary applicant
sec_app_revol_util: Ratio of total current balance to high credit/credit limit for all revolving accounts
sec_app_open_act_il: Number of currently active installment trades at time of application for the secondary applicant
sec_app_num_rev_accts: Number of revolving accounts at time of application for the secondary applicant
sec_app_chargeoff_within_12_mths: Number of charge-offs within last 12 months at time of application for the secondary applicant
sec_app_collections_12_mths_ex_med: Number of collections within last 12 months excluding medical collections at time of application for the secondary applicant
sec_app_mths_since_last_major_derog: Months since most recent 90-day or worse rating at time of application for the secondary applicant
hardship_reason: Describes the reason the hardship plan was offered
hardship_status: Describes if the hardship plan is active, pending, canceled, completed, or broken
hardship_amount: The interest payment that the borrower has committed to make each month while they are on a hardship plan
hardship_dpd: Account days past due as of the hardship plan start date
hardship_loan_status: Loan Status as of the hardship plan start date
orig_projected_additional_accrued_interest: The original projected additional interest amount that will accrue for the given hardship payment plan as of the Hardship Start Date. This field will be null if the borrower has broken their hardship payment plan.
hardship_payoff_balance_amount: The payoff balance amount as of the hardship plan start date
hardship_last_payment_amount: The last payment amount as of the hardship plan start date
disbursement_method: The method by which the borrower receives their loan. Possible values are: CASH, DIRECT_PAY
settlement_status: The status of the borrower’s settlement plan. Possible values are: COMPLETE, ACTIVE, BROKEN, CANCELLED, DENIED, DRAFT
settlement_amount: The loan amount that the borrower has agreed to settle for
settlement_term: The number of months that the borrower will be on the settlement plan
for column in logscale_histograms:
caption = get_wrapped_caption(column, feat_desc, width=200)
kwargs = {"aspect": 3.3, "log_scale": True, "bins": 50}
plot_distribution(loan_data, column, f"Distribution of {column}", caption, **kwargs)
installment: The monthly payment owed by the borrower if the loan originates.
annual_inc: The self-reported annual income provided by the borrower during registration.
dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
revol_bal: Total credit revolving balance
revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
out_prncp: Remaining outstanding principal for total amount funded
out_prncp_inv: Remaining outstanding principal for portion of total amount funded by investors
total_pymnt: Payments received to date for total amount funded
total_pymnt_inv: Payments received to date for portion of total amount funded by investors
total_rec_int: Interest received to date
last_pymnt_amnt: Last total payment amount received
annual_inc_joint: The combined self-reported annual income provided by the co-borrowers during registration
tot_cur_bal: Total current balance of all accounts
mths_since_rcnt_il: Months since most recent installment accounts opened
total_bal_il: Total current balance of all installment accounts
il_util: Ratio of total current balance to high credit/credit limit on all install acct
max_bal_bc: Maximum current balance owed on all revolving accounts
total_rev_hi_lim: Total revolving high credit/credit limit
avg_cur_bal: Average current balance of all accounts
bc_open_to_buy: Total open to buy on revolving bankcards.
bc_util: Ratio of total current balance to high credit/credit limit for all bankcard accounts.
mo_sin_old_il_acct: Months since oldest bank installment account opened
mo_sin_old_rev_tl_op: Months since oldest revolving account opened
mo_sin_rcnt_rev_tl_op: Months since most recent revolving account opened
mo_sin_rcnt_tl: Months since most recent account opened
mths_since_recent_bc: Months since most recent bankcard account opened.
tot_hi_cred_lim: Total high credit/credit limit
total_bal_ex_mort: Total credit balance excluding mortgage
total_bc_limit: Total bankcard high credit/credit limit
total_il_high_credit_limit: Total installment high credit/credit limit
revol_bal_joint: Sum of revolving credit balance of the co-borrowers, net of duplicate balances
settlement_percentage: The settlement amount as a percentage of the payoff balance amount on the loan
for column in boolean_plots:
caption = get_wrapped_caption(column, feat_desc, width=200)
kwargs = {"aspect": 3.3}
mapper = {True: "True", False: "False"}
plot_distribution(
loan_data[column].map(mapper).to_frame(),
column,
f"Distribution of {column}",
caption,
**kwargs,
)
pymnt_plan: Indicates if a payment plan has been put in place for the loan
hardship_flag: Flags whether or not the borrower is on a hardship plan
debt_settlement_flag: Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company.
for column in line_plots:
caption = get_wrapped_caption(column, feat_desc, width=200)
kwargs = {"aspect": 3.3}
plot_date_distribution(
loan_data, column, f"Distribution of {column}", caption, **kwargs
)
issue_d: The month which the loan was funded
earliest_cr_line: The month the borrower's earliest reported credit line was opened
last_pymnt_d: Last month payment was received
next_pymnt_d: Next scheduled payment date
last_credit_pull_d: The most recent month LC pulled credit for this loan
sec_app_earliest_cr_line: Earliest credit line at time of application for the secondary applicant
hardship_start_date: The start date of the hardship plan period
hardship_end_date: The end date of the hardship plan period
payment_plan_start_date: The day the first hardship plan payment is due. For example, if a borrower has a hardship plan period of 3 months, the start date is the start of the three-month period in which the borrower is allowed to make interest-only payments.
debt_settlement_flag_date: The most recent date that the Debt_Settlement_Flag has been set
settlement_date: The date that the borrower agrees to the settlement plan